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Abstract — We present a set of high-probability inequalities 
that control the concentration of weighted averages of multiple 
(possibly uncountably many) simultaneously evolving and inter- 
dependent martingales. Our results extend the PAC-Bayesian 
analysis in learning theory from the i.i.d. setting to martingales 
opening the way for its application to importance weighted 
sampling, reinforcement learning, and other interactive learning 
domains, as well as many other domains in probability theory 
and statistics, where martingales are encountered. 

We also present a comparison inequality that bounds the 
expectation of a convex function of a martingale difference 
sequence shifted to the [0, 1] interval by the expectation of the 
same function of independent Bernoulli variables. This inequality 
is applied to derive a tighter analog of Hoeffding-Azuma's 
inequality. 

Index Terms — Martingales, Hoeffding-Azuma's inequality, 
Bernstein's inequality, PAC-Bayesian bounds. 

I. Introduction 

MARTINGALES are one of the fundamental tools in 
probability theory and statistics for modeling and 
studying sequences of random variables. Some of the most 
well-known and widely used concentration inequalities for 
individual martingales are Hoeffding-Azuma's and Bernstein's 
inequalities IT], ||2l, We present a comparison inequality 
that bounds the expectation of a convex function of a mar- 
tingale difference sequence shifted to the [0, 1] interval by 
the expectation of the same function of independent Bernoulli 
variables. We apply this inequality in order to derive a tighter 
analog of Hoeffding-Azuma's inequality for martingales. 

More importantly, we present a set of inequalities that 
make it possible to control weighted averages of multiple 
simultaneously evolving and interdependent martingales (see 
Fig. [T] for an illustration). The inequalities are especially 
interesting when the number of martingales is uncountably 
infinite and the standard union bound over the individual 
martingales cannot be applied. The inequalities hold with high 
probability simultaneously for a large class of averaging laws 
p. In particular, p can depend on the sample. 

One possible application of our inequalities is an analysis 
of importance-weighted sampling. Importance-weighted sam- 
pling is a general and widely used technique for estimating 
properties of a distribution by drawing samples from a dif- 
ferent distribution. Via proper reweighting of the samples, the 
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Fig. 1. Illustration of an infinite set of simultaneously evolving and inter- 
dependent martingales. is a space that indexes the individual martingales. 
For a fixed point h & 'H, the sequence Mi{h), M2{h), . . . , Mn{h) is a 
single martingale. The arrows represent the dependencies between the values 
of the martingales: the value of a martingale h at time i, denoted by Mi(h), 
depends on Af, {h') for all j < i and h' d'H (everything that is "before" and 
"concurrent" with Miih) in time; some of the arrows are omitted for clarity). 
A mean value of the martingales with respect to a probability distribution 
p over "H is given by {M„,p). Our high-probability inequalities bound 
\{Mn,p)\ simultaneously for a large class of p. 



expectation of the desired statistics based on the reweighted 
samples from the controlled distribution can be made identical 
to the expectation of the same statistics based on unweighted 
samples from the desired distribution. Thus, the difference 
between the observed statistics and its expected value forms 
a martingale difference sequence. Our inequalities can be ap- 
plied in order to control the deviation of the observed statistics 
from its expected value. Furthermore, since the averaging law 
p can depend on the sample, the controlled distribution can be 
adapted based on its outcomes from the preceding rounds, for 
example, for denser sampling in the data-dependent regions 
of interest. See |4| for an example of an application of this 
technique in reinforcement learning. 

Our concentration inequalities for weighted averages of 
martingales are based on a combination of Donsker-Varadhan's 
variational formula for relative entropy Q, 10, El with 
bounds on certain moment generating functions of martingales, 
including Hoeffding-Azuma's and Bernstein's inequalities, as 
well as the new inequality derived in this paper 

In a nutshell, the Donsker-Varadhan's variational formula 
implies that for a probability space {T-l,B), a bounded real- 
valued random variable $ and any two probability distributions 
TT and p over H (or, if H is uncountably infinite, two 
probability density functions), the expected value Ep[<&] is 
bounded as: 



Ep[$] <KL(p|K)+lnE^[e^ 



(1) 
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where KL(/3||7r) is the KL-divergence (relative entropy) be- 
tween two distributions fS\. We can also think of $ as 
$ = (j>{h), where 4){h) is a measurable function (p : H ^ R. 
Inequality ([T]) can then be written using the dot-product 
notation 

{^,p) <KL{p\\7:)+ln{{etn)) (2) 

and Ep[(p] — {<l>,p) can be thought of as a weighted average 
of (p with respect to p (for countable H it is defined as 
{(f), p) — J^hen 4>{^)p{h) ™d for uncountable % it is defined 
as {cj),p)= ^^<i>{h)p{h)dh)n 

The weighted averages ((p7p) on the left hand side of (|2]) 
are the quantities of interest and the inequality allows us to 
relate all possible averaging laws p to a single "reference" 
distribution tt. (Sometimes, tt is also called a "prior" distribu- 
tion, since it has to be selected before observing the sample.) 
We emphasize that inequality (|2]i is a deterministic relation. 
Thus, by a single application of Markov's inequality to (e"*, tt) 
we obtain a statement that holds with high probability for 
all p simultaneously. The quantity ln(e'^,7r), known as the 
cumulant-generating function of (j), is closely related to the 
moment-generating function of 0. The bound on ln(e'^,7r), 
after some manipulations, is achieved via the bounds on 
moment-generating functions, which are identical to those 
used in the proofs of Hoeffding-Azuma's, Bernstein's, or our 
new inequality, depending on the choice of 0. 

Donsker-Varadhan's variational formula for relative entropy 
laid the basis for PAC-Bayesian analysis in statistical learning 
theory |9|, |10|, [IIJ, L12|, where PAC is an abbreviation 
for the Probably Approximately Correct learning model intro- 
duced by Valiant {TS\. PAC-Bayesian analysis provides high 
probability bounds on the deviation of weighted averages of 
empirical means of sets of independent random variables from 
their expectations. In the learning theory setting, the space 
T-L usually corresponds to a hypothesis space; the function 
(t){h) is related to the difference between the expected and 
empirical error of a hypothesis h\ the distribution vr is a prior 
distribution over the hypothesis space; and the distribution 
p defines a randomized classifier. The randomized classifier 
draws a hypothesis h from Ti. according to p at each round 
of the game and applies it to make the prediction on the 
next sample. PAC-Bayesian analysis supplied generalization 
guarantees for many influential machine learning algorithms, 
including support vector machines llT4l . ifTSl . linear classifiers 
llT6l . and clustering-based models ifTTl . to name just a few of 
them. 

We show that PAC-Bayesian analysis can be extended to 
martingales. A combination of PAC-Bayesian analysis with 
Hoeffding-Azuma's inequality was applied by Lever et. al ifTSl 
in the analysis of U-statistics. The results presented here are 
both tighter and more general, and make it possible to apply 

'The complete statement of Donsker-Varadhan's variational formula for 
relative entropy states that under appropiiate conditions KL(p||7r) = 
sup0 ((f/), p) — ln(e'^, tt}) , where the supremum is achieved by (t>{h) = 



In 



p(fe) 



However, in our case the choice of (f> is directly related to the values 
of the martingales of interest and the free parameters in the inequality are the 
choices of p and tt. Therefore, we are looking at the inequality in the form 
of equation Q and a more appropriate name for it is "change of measure 
inequality". 



PAC-Bayesian analysis in new domains, such as, for example, 
reinforcement learning ID. 

II. Main Results 

We first present our new inequalities for individual martin- 
gales, and then present the inequalities for weighted averages 
of martingales. All the proofs are provided in the appendix. 

A. Inequalities for Individual Martingales 

Our first lemma is a comparison inequality that bounds 
expectations of convex functions of martingale difference 
sequences shifted to the [0, 1] interval by expectations of the 
same functions of independent Bernoulli random variables. 
The lemma generalizes a previous result by Maurer for inde- 
pendent random variables |fT9l . The lemma uses the following 
notation: for a sequence of random variables Xi , . . . , we 
use X\ Xi,. . . ,Xi to denote the first i elements of the 
sequence. 

Lemma 1: Let Xi,...,X„ be a sequence of random 
variables, such that Xi e [0, 1] with probability 1 and 
E[X,|Xj"i] = b, for i = l,...,n. Let Yi,...,r„ be 
independent Bernoulli random variables, such that W^Yi] — hi. 
Then for any convex function / : [0, 1]" M : 

E[/(Xi,...,x„)] <E[/(yi,...,r„)]. 

Let kl(p||g) = pin 2 + (l —p) In be an abbreviation for 
KL ([p, l-p\\\ [q, 1 - g]), where [p, I - p] and [q, 1 - g] are 
Bernoulli distributions with biases p and q, respectively. By 
Pinsker's inequality JS), 

\p~q\ < v/kl(p||g)/2, 

which means that a bound on kl(p||(7) implies a bound on 
the absolute difference between the biases of the Bernoulli 
distributions. 

We apply Lemma [T] in order to derive the following in- 
equality, which is an interesting generalization of an analogous 
result for i.i.d. variables. The result is based on the method of 
types in information theory IS). 

Lemma 2: Let Xi , . . . , Xn be a sequence of random 
variables, such that Xi e [0, 1] with probability 1 and 



E[X,\Xl-^] - b. Let Sn := Eti Then: 

eUK^S-II^-)] <n+l. 



(3) 



Note that in Lemma [2] the conditional expectation E[Xi|X{^^] 
is identical for all i, whereas in Lemma [T] there is no such 
restriction. Combination of Lemma[2]with Markov's inequality 
leads to the following analog of Hoeffding-Azuma inequality. 

Corollary 3: Let Xi, . . . , Xn be as in Lemma |2] Then, for 
any S € (0, 1), with probability greater than 1 — 6: 



kl ( -Sn 



<iln!^ 
n 



(4) 



Sn is a terminal point of a random walk with bias b after n 
steps. By combining Corollary |3] with Pinsker's inequality we 
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can obtain a more explicit bound on the deviation of the ter- 
minal point from its expected value, \Sn — bn\ < ^ ^ In ^i^, 
which is similar to the result we can obtain by applying 
Hoeffding-Azuma's inequality. However, in certain situations 
the less explicit bound in the form of kl is significantly tighter 
than Hoeffding-Azuma's inequality and it can also be tighter 
than Bernstein's inequality. A detailed comparison is provided 
in Section HIH 

B. PAC-Bayesian Inequalities for Weighted Averages of Mar- 
tingales 

Next, we present several inequalities that control the con- 
centration of weighted averages of multiple simultaneously 
evolving and interdependent martingales. The first result shows 
that the classical PAC-Bayesian theorem for independent ran- 
dom variables |12| holds in the same form for martingales. 
The result is based on combination of Donsker-Varadhan's 
variational formula for relative entropy with Lemma[2] In order 
to state the theorem we need a few definitions. 

Let {H,B) be a probability space. Let Xi, . . . , Xn be a 
sequence of random functions, such that Xi : % [0, 1]. 
Assume that K[X,\Xi,. . . , X,_i] = b, where 5 : H ^ [0, 1] 
is a deterministic function (possibly unknown). This means 
that E[Xi(/i)|Xi, . . . , = h{h) for each i and h. Note 

that for each h G H the sequence Xi(h), . . . , Xn{h) satisfies 
the condition of Lemma |2] 

Let Sn ■= Y^^^i-^i- ^ the following theorem we are 
bounding the mean of Sn with respect to any probability 
measure p over H. 

Theorem 4 (PAC-Bayes-kl Inequality): Fix a reference dis- 
tribution vr over T-L. Then, for any 5 € (0, 1), with probability 
greater than 1 — 5 over Xi, . . . , Xn, for all distributions p over 
Ti. simultaneously: 



kl 



On 

n 



P 



KL(p|k) 



(5) 



By Pinsker's inequality. Theorem |4] implies that 



Sn,p) - {b,p) = 




Sn - b 



KL(p||7r) + ln- 



2n 



(6) 



however, if (^<5'n,p) is close to zero or one, inequality (|5]l is 
significantly tighter than (|6]l. 

The next result is based on combination of Donsker- 
Varadhan's variational formula for relative entropy with 
Hoeffding-Azuma's inequality. This time let Zi, . . . , Zn be 
a sequence of random functions, such that Zi : H K. Let 
Zl be an abbreviation for a subsequence of the first i random 
functions in the sequence. We assume that E[Zi\Zl] = 0. In 
other words, for each h G H the sequence Zi{h), . . . , Zn{h) 
is a martingale difference sequence. 

Let Mi := Then, for each h G H the sequence 

Mi{h), . . . , Mn{h) is a martingale. In the following theorems 
we bound the mean of Mn with respect to any probability 
measure p onH. 



Theorem 5: Assume that Zi : H ^ [a,;, /3i]. Fix a reference 
distribution tt over H and A > 0. Then, for any S E (0,1), 
with probability greater than 1 — 6 over Z", for all distributions 
p over H simultaneously: 



\{Mn,p)\ < 



KLip\\n) 



A 



A 



a,f. (7) 



We note that we cannot minimize inequality ^ simultane- 
ously for all p by a single value of A. In the following theorem 
we take a grid of A-s in a form of a geometric sequence and 
for each value of KL(p||7r) we pick a value of A from the 
grid, which is the closest to the one that minimizes (|7]l. The 
result is almost as good as what we could achieve if we would 
minimize the bound just for a single value of p. 

Theorem 6 ( PAC-Bayes-Hoeffding-Azuma Inequality): 
Assume that Z" is as in Theorem [s] Fix a reference 
distribution tt over H. Take an arbitrary number c > 1. Then, 
for any S g (0, 1), with probability greater than 1 — 5 over 
Z", for all distributions p over H simultaneously: 



\{Mn,p)\ 



< 



where 



1 + c 
2V2\ 



<P) = 



KL(p||^)+ln-+e(p))^(A 



(8) 



In 2 
21nc 



1 + ln 



I^ kl(pII^) 



Our last result is based on a combination of Donsker- 
Varadhan's variational formula with a Bemstein-type in- 
equality for martingales. Let Vi : "H — M be such that 
Vtih) ■■= Ej=iE \ Zj{hf Z{~'^]. In other words, V^{h) is 
the variance of the martingale Mi{h) defined earlier. Let 
\\Zt\\oo = sup^g^ Z,{h) be the_Loo norm of Z^. 

Theorem 7: Assume that HZiHoo < K for all i with 
probability 1 and pick A, such that A < 1/K. Fix a refer- 
ence distribution tt over Ti. Then, for any 5 e (0, 1), with 
probability greater than 1 
over H simultaneously: 



5 over Z", for all distributions p 



mn,p)\ < 



KL(Hk) 



A 



+ {e-2)X{Vn,p). (9) 



As in the previous case, the right hand side of (|9| cannot 
be minimized for all p simultaneously by a single value of 
A. Furthermore, V"„ is a random function. In the following 
theorem we take a similar grid of A-s, as we did in Theorem 
|6] and a union bound over the grid. Picking a value of A from 
the grid closest to the value of A that minimizes the right hand 
side of (|9]l yields almost as good result as we would get if we 
would minimize (|9| for a single choice of p. In this approach 
the variance Vn can be replaced by a sample-dependent upper 
bound. For example, in importance-weighted sampling such 
an upper bound is derived from the reciprocal of the sampling 
distribution at each round 14|. 

Theorem 8 (PAC-Bayes-Bernstein Inequality): Assume 
that 1 1 Zi I loo < K for all i with probability 1. Fix a reference 
distribution tt over T-L. Pick an arbitrary number c > 1. Then, 
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for any S E (0, 1), with probability greater than 1 — 6 over 
Z", simultaneously for all distributions p over H that satisfy 



/ KL(p||7r)+lngf 

{e-2){Vn,p) 



< 



K 



(10) 



we have 



|(M„,p)| < (l + c)J(e-2)(K,p) KL(H|7r)+ln 



2v 



where 



In 



(e-2)r 



In ^ 



Inc 

and for all other p 

\{NU,p)\<2K (kUpW^) 



In- 



2u 



(11) 
(12) 

(13) 



(\x\ is the smallest integer value that is larger than x.) 

III. Comparison of the Inequalities 

In this section we remind the reader of Hoeffding-Azuma's 
and Bernstein's inequalities for individual martingales and 
compare them with our new kl-form inequality. Then, we 
compare inequalities for weighted averages of martingales 
with inequalities for individual martingales. 

A. Background 

We first recall Hoeffding-Azuma's inequality HI, ||2l. For 
a sequence of random variables Zi, . . . , Z„ we use Z\ := 
Zi , . . . , to denote the first i elements of the sequence. 

Lemma 9 (Hoejfding-Azuma's Inequality): Let Zi, . . . , Z„ 
be a martingale difference sequence, such that Zi G 
with probability 1 and K[Z,\Z{^^] = 0. Let Mi ^ J2 
be the corresponding martingale. Then for any A € 

By combining Hoeffding-Azuma's inequality with Markov's 

8 111 



3 = 1 



inequality and taking ^ = y jy^ — (i3-~a y ^^^^ obtain 
the following corollary. 

Corollary 10: For Af„ defined in Lemma |9] and 5 £ (0, 1), 
with probability greater than 1 — S: 



\MJ < 



\ 



2 \d 



E 



(A 



The next lemma is a Bernstein-type inequality lO, ll20l . We 
provide the proof of this inequality in Appendix [C] the proof 
is a part of the proof of 121. Theorem 1]. 

Lemma 1 1 (Bernstein's Inequality): Let Zi,...,Z„ be a 
martingale difference sequence, such that \Zi\ < K with 
probability 1 and E[Zi|Zj~i] = 0. Let M, 



]. Then for any A £ [0, 



and 



E 



,AAf„-(e-2)AV„ 



< 1. 



By combining Lemma [TT| with Markov's inequality we 
obtain that for any A S [0, J] and S e (0, 1), with probability 
greater than 1 — S: 



|M„| < ^ln^+A(e-2)K. 
A 



(14) 



Vn is a random variable and can be replace d by an upper 

bound. Inequality ([14]) is minimized by A* = (e-2) v ~ ■ Note 
that A* depends on Vn and is not accessible until we observe 
the entire sample. We can bypass this problem by constructing 
the same grid of A-s, as the one used in the proof of Theorem 
|8] and taking a union bound over it. Picking a value of A 
closest to A* from the grid leads to the following corollary. In 
this bounding technique the upper bound on Vn can be sample- 
dependent, since the bound holds simultaneously for all A-s 
in the grid. Despite being a relatively simple consequence of 
Lemma [TT] we have not seen this result in the literature. The 
corollary is tighter than an analogous result by Beygelzimer 
et. al. ED Theorem 1]. 

Corollary 12: For and Vn as defined in Lemma 11 
c > 1 and 5 e (0, 1), with probability greater than 1 — 5, if 



In 



2v 



< 



1 



(e - 2)Vn - K 



(15) 



then 



\Mn\ < (l + c)W(e-2)Kln 



2iy 



where is defined in ( [T2] i, and otherwise 

|M„| < 2i^ln^. 

d 

The technical condition ( fTSj l follows from the requirement 
of Lemma [IT] that A e [0, j^]. 

B. Comparison 

We first compare inequalities for individual martingales in 
Corollaries |3] [TOl and [H] 

Comparison of Inequalities for Individual Martingales: 
The comparison between Corollaries [TO] and [12] is relatively 
straightforward. We note that the assumption E[Zi\Z[^^] = 
implies that < and that V,. < 

stein's inequality) matches Corollary [TO] (derived from 
Hoeffding-Azuma's inequality) up to minor constants and 
logarithmic factors in the general case, and can be much tighter 
when the variance is small. 

The comparison with the kl inequality in Corollary [5] is 
a bit more involved. As we mentioned after Corollary J3]^its 
combination with Pinsker's inequality implies that |5„ — fen| < 

In where S,„ — bn is 



max{af,f^f} 



< 



ai) . Hence, Corollary [12] (derived from Bern- 



2 vvii^i^ Sn — 5n is a martingale corresponding 

to the martingale difference sequence Zi = Xi — b. Thus, 
Corollary [3]is at least as tight as Hoeffdi ng-Azum a's inequality 
in Corollary 



10 



up to a factor of y In . This is also 
true if Xi e'^i,/?^] (rather than [0,1]), as long as we can 
simultaneously project all Xi-s to the [0, 1] interval without 
losing too much. 
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Tighter upper bounds on the kl divergence show that in 
certain situations Corollary [3] is actually much tighter than 
Hoeffding-Azuma's inequality. One possible application of 
Corollary |3]is estimation of the value of the drift b of a random 
walk from empirical observation Sn- If Sn is close to zero, 
it is possible to use a tighter bound on kl, which states that 
for p > q we have p < q + ^y2qk\{q\\p) + 2kl(g||p) ifTSl . 
From this inequality, we obtain that with probability greater 
than I - S: 



1 /^5„ln^i±i 21n^4i 

b<-Sn + \ ^ + 

n \l n n 

The above inequality is tighter than Hoeffding- Azuma inequal- 
ity whenever ^Sn < 1/8. Since kl is convex in each of its 
parameters, it is actually easy to invert it numerically, and 
thus avoid the need to resort to approximations in practice. In 
a similar manner, tighter bounds can be obtained when 5„ is 
close to n. 

The comparison of kl inequality in Corollary |3] with Bern- 
stein's inequality in Corollary 12 is not as equivocal as the 



comparison with Hoeffding-Azuma's inequality. If there is a 
bound on Vn that is significantly tighter than n, Bernstein's 
inequality can be significantly tighter than the kl inequality, but 
otherwise it can also be the opposite case. In the example of 
estimating a drift of a random walk without prior knowledge 
on its variance, if the empirical drift is close to zero or to 
n the kl inequality is tighter. In this case the kl inequality is 
comparable with empirical Bernstein's bounds ll22l . ll23l . ll24l . 

Comparison of Inequalities for Individual Martingales with 
PAC-Bayesian Inequalities for Weighted Averages of Martin- 
gales: The "price" that is paid for considering weighted av- 
erages of multiple martingales is the KL-divergence KL(p||7r) 
between the desired mixture weights p and the reference 
mixture weights tt. (In the case of PAC-Bayes-Hoeffding- 
Azuma inequality, Theorem|6j there is also an additional minor 
term originating from the union bound over the grid of A-s.) 
Note that for p = tt the KL term vanishes. 

IV. Discussion 

We presented a comparison inequality that bounds expec- 
tation of a convex function of martingale difference type 
variables by expectation of the same function of independent 
Bernoulli variables. This inequality enables to reduce a prob- 
lem of studying continuous dependent random variables on 
a bounded interval to a much simpler problem of studying 
independent Bernoulli random variables. 

As an example of an application of our lemma we derived an 
analog of Hoeffding-Azuma's inequality for martingales. Our 
result is always comparable to Hoeffding-Azuma's inequality 
up to a logarithmic factor and in cases, where the empirical 
drift of a corresponding random walk is close to the region 
boundaries it is tighter than Hoeffding-Azuma's inequality by 
an order of magnitude. It can also be tighter than Bernstein's 
inequality for martingales, unless there is a tight bound on the 
martingale variance. 

Finally, but most importantly, we presented a set of inequal- 
ities on concentration of weighted averages of multiple si- 
multaneously evolving and interdependent martingales. These 



inequalities are especially useful for controlling uncountably 
many martingales, where standard union bounds cannot be 
applied. Martingales are one of the most basic and important 
tools for studying time-evolving processes and we believe that 
our results will be useful for multiple domains. One such 
application in analysis of importance weighted sampling in 
reinforcement learning was already presented in ||4|. 

Appendix A 

Proofs of the Results for Individual Martingales 

Proof of Lemma U\ The proof follows the lines of 
the proof of Maurer 11191 Lemma 3]. Any point x = 
(xi, . . . , Xn) G [0, 1]" can be written as a convex combination 
of the extreme points fj = (jii, . . . ,rin) G {0,1}" in the 
following way: 

lY[[il-Xi){l~■r^,) + x^'n,]\fj. 

?7e{o,i}" I 
Convexity of / therefore implies 

/(S)< E (n[(l-^0(l-'7.)+a^.'7.]) /(^) (16) 

f;e{0,l}" \i=l / 

with equality if x e {0, 1}". Let X\ X^,...,X, be the 
first i elements of the sequence Xi, . . . Let Wi{r]i) = 
{\-Xi)(\-m) + Xim and let Wi{j]i) = {l-bi){l-r]i) + bir]i. 
Note that by the assumption of the lemma: 

E[W,{r^,)\Xl-'] = E[(l - X,){1 - r,,) + X,ri,\X\-^] 
(1 - hi){l - 77,,) + 6j77i = w,{rii). 

By taking expectation of both sides of ([T6| we obtain: 



f;G{0,l}" \i=l / 



*7G{0,1}" 

E ^xr^ 
E ^xr^ 

f;e{o,i}" 

E ^xr^ 

f;e{o,i}" 



.1=1 



fin) 



E 



x„ 



.1=1 



XI 



fin) 



.i=l 



fiv) 



n ^^i-ni) 



WniVn)fiv) 



(17) 



= E n^'(^*) /(^) 

f;e{0,l}" \i=l / 

f;e{0,l}" \i=l / 
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In ( [T7] l we apply induction in order to replace Xi by hi, one- 
by-one from the last to the first, same way we did it for X„. 

■ 

Lemma |2] follows from the following concentration result 
for independent Bernoulli variables that is based on the method 
of types in information theory [SJ. Its proof can be found in 

ma, im. 

Lemma 13: Let Yi, . . . , y„ be i.i.d. Bernoulli random vari- 
ables, such that ¥.\Yi\ = h. Then: 



E 



'ki(iEr=i>'.||ft) 



< n + 1. 



(18) 



For n > 8 it is possible to prove even stronger result 
^ < E[e"'=^(i^?=i^'ll^)] < 2^/^^ using Stirling's approx- 
imation of the factorial lfT9l . For the sake of simplicity we 
restrict ourselves to the slightly weaker bound ( [T8] l, although 
all results that are based on Lemma|2]can be slightly improved 
by using the tighter bound. 

Proof of Lemma [2]- Since KL-divergence is a convex 
function 1(8] and the exponent function is convex and non- 
decreasing, e""^'^^"^' is also a convex function. Therefore, 
Lemma |2] follows from Lemma [13] by Lemma [T] ■ 

Corollary [3] follows from Lemma |2]by Markov's inequality. 

Lemma 14 (Markov's inequality): For 5 £ (0,1) and a 
random variable X >{), with probability greater than \ — 5: 



X < -K[X] 





(19) 



Proof of Corollary [i[ By Markov's inequality and 
Lemma |2j with probability greater than 1 — S: 



e"kl(iS,.||fc) < [enkl(i5„||b) 

5 



< 



n + 1 



Taking logarithm of both sides of the inequality and normal- 
izing by n completes the proof. ■ 

Appendix B 
Proofs of PAC-Bayesian Theorems for 
Martingales 

In this appendix we provide the proofs of Theorems |4j |7] 
and [8] The proof of Theorem [5] is very similar to the proof of 
Theorem |7] and, therefore, omitted. The proof of Theorem |6] 
is very similar to the proof of Theorem |8] so we only provide 
the way of how to choose the grid of A-s in this theorem. 

The proofs of all PAC-Bayesian theorems are based on 
the following lemma, which is obtained by changing sides in 
Donsker-Varadhan's variational definition of relative entropy. 
The lemma takes roots back in information theory and statis- 
tical physics Is], IS], Q. The lemma provides a deterministic 
relation between averages of (j) with respect to all possible 
distributions p and the cumulant generating function ln(e'^,7r) 
with respect to a single reference distribution tt. A single 
application of Markov's inequality combined with the bounds 
on moment generating functions in Lemmasj2] [9] and 1 1 is 
then used in order to bound the last term in ( |20) in the proofs 
of Theorems |4] |5] and |7j respectively. 



Lemma 15 (Change of Measure Inequality): For any prob- 
ability space {H,B), a measurable function </) : "H — > M, and 
any distributions tt and p over H, we have: 



,p) <KL(p||7r) + ln(e'^,7r) 



(20) 



Since the KL-divergence is infinite when the support of p 
exceeds the support of tt, inequality ( |20l ) is interesting when 
TT ^ p. For a similar reason, it is interesting only when {e'^, tt) 
is finite. We note that the inequality is tight in the same sense 
as Jensen's inequality is tight: for 0(/i) = In it becomes 
an equality. 

Proof of Theorem^ Take := nkl {^Sn{h)\\b{h)) . 
More compactly, denote 4> = kl(iS'„||6) : 'H -)■ M. Then 
with probability greater than 1 — S for all p: 



nkl [ ( ^Sn,p 



<n{kU 



1 



< KL(p||7r) +ln(e' 



,nkl(iS„l[6) 



<KL(p||^)+ln( -E^„ 



= KL(/9||7r) +ln 
< KL(p||7r) +ln 



1 

S 

n+1 



E 



X" 



^nkl(iS„|16) 



^nkl(iS„||6) 



(21) 
(22) 
(23) 

(24) 
(25) 



where ( |2T| is by convexity of the kl divergence [8]; (|22]) is 
by change of measure inequality (Lemma 15 1; ( |23] l holds with 
probability greater than 1 — (5 by Markov's inequality; in (|24]) 
we can take the expectation inside the dot product due to 
linearity of both operations and since tt is deterministic; and 
( p5| ) is by Lemma [2p] Normalization by n completes the proof 
of the theorem. ■ 
Proof of Theorem For the proof of Theorem |7] we 
take ^(/i) := \Mn{h) — (e — 2)A^V^(/i). Or, more compactly, 
(j) = XMn — (e ^ 2)A^K! . Then with probability greater than 



1 



for all p: 



X{Mn,p) - (e - 2)A2(V;, p) = (AM„ - (e - 2)X^V„, p) 



< KL(p||7r) +ln(e 



< KL(p||7r) + ln 



,AJ\f„-(e-2)A^V„ 



;E 



= KL(Hl7r) 
< KL(p||7r) 



In 



In 



E^„ 



(26) 



(27) 



where p7| ) is by Lemma 11 and other steps are justified in 
the same way as in the previous proof. 

By applying the same argument to — A7„, taking a union 
bound over the two results, taking (e — 2)A^(V"„,p) to the 



By Lemma 2 for each /i S W we have Ej^ 



n + 1 and, therefore, ( E , 



."kl(iS„||6) 



,nkl(iS„(h)||6(h))l < 



,7r) < ra + 1. 
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Other side of the inequality, and normalizing by A, we obtain 
the statement of the theorem. ■ 
Proof of Theorem [S]- The value of A that minimizes (|9]l 
depends on p, whereas we would like to have a result that holds 
for all possible distributions p simultaneously. This requires 
considering multiple values of A simultaneously and we have 
to take a union bound over A-s in step ( |26] l of the proof of 
Theorem]?] We cannot take all possible values of A, since there 
are uncountably many possibilities. Instead we determine the 
relevant range of A and take a union bound over a grid of A- 
s that forms a geometric sequence over this range. Since the 
range is finite, the grid is also finite. 

The upper bound on the relevant range of A is determined 
by the constraint that A < ^ . For the lower bound we note that 
since KL(/9||7r) > 0, the value of A that minimizes (|9| is lower 



bounded by 



111 ^ 



We also note that {V^.p) < 



since < K for all h and i. Hence, X > y {e-2)n 

and the range of A we are interested in is 



A e 




1 

2)n' K 



We co ver the above range with a grid of A^-s, such that A^ 

In 2 

^ 0, . . . , m — 1. It is easy to see that in 



for i 



^ K \ (e-2)n 

order to cover the interval of relevant A we need 




ln( 



(A„i_i is the last value that is strictly less than 1/K and we 
take Am := l/K for the case when the technical condition 
( [TO| i is not satisfied). This defines the value of in ( [T2j i. 

Finally, we note that (|9]l has the form g(A) = x + P^^" 
the relevant range of A, there is A^* that satisfies \JlJjV < 
Xi* < C\fljfV . For this value of A we have g{Xi*) < (1 + 
c)VUV. 

Therefore, whenever ( [TO| i is satisfied we pick the highest 
value of Xi that does not exceed the left hand side of ( [TO] l, 
substitute it into (|9|, and obtain ( fTTj ), where the Inv factor 
comes from the union bound over A^-s. If ( fTO] ) is not satisfied, 
we know that (K, p) < {KL{p\\ti) + In ^) /(e - 2) and 
by taking X = \/K and substituting into (|9]l we obtain ([TSj. 



Proof of Theorem^ Theorem |6] follows from Theorem[5] 
in the same way as Theorem [8] follows from Theorem |7] The 
only difference is that the relevant range of A is unlimited 
from above. If KL(p||7r) = the bound is minimized by 



Sin I 



hence, we are interested in A that is larger or equal to this 
value. We take a grid of Aj-s of the form 



Slnf 



for i > 0. Then for a given value of KL(p||7r) we have to pick 
A,, such that 



In 



/kl(pM 



1 



21nc 



where \x\ is the largest integer value that is smaller than x. 
Taking a weighted union bound over A^-s with weights 2~*^*+^' 
completes the proof. (In the weighted union bound we take 
(5, = (52"('+i^ Then by substitution of 5 with (5„ (|7| holds 
with probability greater than 1 — 5i for each Xi individually, 
and with probability greater than 1 — YITLq — ^ — 5 for all 
Xi simultaneously.) ■ 

Appendix C 
Background 



In this section we provide a proof of Lemma 1 1 The proof 
reproduces an intermediate step in the proof of Ii21i Theorem 
1]. 

Proof of Lemma [77|- First, we have: 

[e^^^\Z{-^] <Ez. [l + AZ, + (e-2)A2(Z,)2|zi-'] 

(28) 

= l + (e-2)A2EzJ(Z,)2|zj-i] (29) 

where ( |28] l uses the fact that < 1 + x + (e — 2)x^ for 
X < 1 (this restricts the choice of A to A < j^, which 
leads to technical conditions ( [TO] l and ([TSj in Theorem [8] and 
Corollary 12 respectively); (|29| uses the martingale property 
Ez, [Z,\Z^] = 0; and ^ uses the fact that 1 + a; < for 
all X. 

We apply inequality (l30l in the following way: 



= Ez 
= ^z 

< . . . 

< 1. 



,AA/„-(e-2)AV„ 



^AAf„_i-(e-2)A^y„_i+AZ„-(e-2)A^E[(Z„)^|Z"~^] 
gAM„_i-(e-2)A2v„_i 

xEz„ [e^^"|zr'] X e-(-2)A^E[(z„)^|zr-^ 

,AA/„_i-(e-2)A^y„_ 



(31) 
(32) 



Er=i(/3.-«o^ 



Inequality ( |3T| applies inequality ( |30l ) and inequality ( |32| ) 
recursively proceeds with Z„_i, . . . , Zi (in reverse order). ■ 
Note that conditioning on additional variables in the proof 
of the lemma does not change the result. This fact is exploited 
in the proof of Theorem |7] when we allow interdependence 
between multiple martingales. 
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